-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OASIS-12] Add trace exporter to Datadog exporter #25759
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #25759 +/- ##
===========================================
+ Coverage 44.85% 51.24% +6.38%
===========================================
Files 2354 1782 -572
Lines 273223 152885 -120338
===========================================
- Hits 122559 78344 -44215
+ Misses 140994 70094 -70900
+ Partials 9670 4447 -5223
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv create-vm --pipeline-id=37232273 --os-family=ubuntu Note: This applies to commit d082ad9 |
Regression DetectorRegression Detector ResultsRun ID: 687e1164-0f32-4ce1-9a0e-64eddd4555d3 Metrics dashboard Target profiles Baseline: 4bddf58 Performance changes are noted in the perf column of each table:
No significant changes in experiment optimization goalsConfidence level: 90.00% There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
|
perf | experiment | goal | Δ mean % | Δ mean % CI | links |
---|---|---|---|---|---|
➖ | basic_py_check | % cpu utilization | +1.74 | [-0.93, +4.42] | Logs |
➖ | file_tree | memory utilization | +1.17 | [+1.06, +1.27] | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | +0.73 | [-11.96, +13.43] | Logs |
➖ | otel_to_otel_logs | ingress throughput | +0.15 | [-0.66, +0.97] | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.00, +0.00] | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | Logs |
➖ | idle | memory utilization | -0.20 | [-0.24, -0.16] | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -1.78 | [-2.65, -0.90] | Logs |
➖ | pycheck_1000_100byte_tags | % cpu utilization | -2.80 | [-7.47, +1.88] | Logs |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
61854d4
to
8e86f44
Compare
8e86f44
to
0bb3e7d
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
e8e9eeb
to
f4bf27e
Compare
This comment has been minimized.
This comment has been minimized.
f4bf27e
to
0fc9425
Compare
This comment has been minimized.
This comment has been minimized.
0fc9425
to
7dfec1c
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
be68310
to
f3f83ae
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
7757a8e
to
476c99a
Compare
This comment has been minimized.
This comment has been minimized.
comp/otelcol/otlp/components/exporter/datadogexporter/traces_exporter.go
Show resolved
Hide resolved
comp/otelcol/otlp/components/exporter/datadogexporter/factory.go
Outdated
Show resolved
Hide resolved
comp/otelcol/otlp/components/exporter/datadogexporter/traces_exporter.go
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
comp/trace/agent/def/component.go
Outdated
type Component interface{} | ||
type Component interface { | ||
// SetStatsdClient sets the stats client of the underlying trace agent. | ||
SetStatsdClient(mclient statsd.ClientInterface) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a way to hack around FX's statsd by allowing us to change the Agent's internal state after startup. This is fragile and probably does not work as expected, given how many places we send the statsd client to on Agent construction:
datadog-agent/pkg/trace/agent/agent.go
Lines 104 to 139 in 24d2617
func NewAgent(ctx context.Context, conf *config.AgentConfig, telemetryCollector telemetry.TelemetryCollector, statsd statsd.ClientInterface) *Agent { | |
dynConf := sampler.NewDynamicConfig() | |
log.Infof("Starting Agent with processor trace buffer of size %d", conf.TraceBuffer) | |
in := make(chan *api.Payload, conf.TraceBuffer) | |
statsChan := make(chan *pb.StatsPayload, 1) | |
oconf := conf.Obfuscation.Export(conf) | |
if oconf.Statsd == nil { | |
oconf.Statsd = statsd | |
} | |
timing := timing.New(statsd) | |
agnt := &Agent{ | |
Concentrator: stats.NewConcentrator(conf, statsChan, time.Now(), statsd), | |
ClientStatsAggregator: stats.NewClientStatsAggregator(conf, statsChan, statsd), | |
Blacklister: filters.NewBlacklister(conf.Ignore["resource"]), | |
Replacer: filters.NewReplacer(conf.ReplaceTags), | |
PrioritySampler: sampler.NewPrioritySampler(conf, dynConf, statsd), | |
ErrorsSampler: sampler.NewErrorsSampler(conf, statsd), | |
RareSampler: sampler.NewRareSampler(conf, statsd), | |
NoPrioritySampler: sampler.NewNoPrioritySampler(conf, statsd), | |
ProbabilisticSampler: sampler.NewProbabilisticSampler(conf, statsd), | |
EventProcessor: newEventProcessor(conf, statsd), | |
StatsWriter: writer.NewStatsWriter(conf, statsChan, telemetryCollector, statsd, timing), | |
obfuscator: obfuscate.NewObfuscator(oconf), | |
In: in, | |
conf: conf, | |
ctx: ctx, | |
DebugServer: api.NewDebugServer(conf), | |
Statsd: statsd, | |
Timing: timing, | |
} | |
agnt.Receiver = api.NewHTTPReceiver(conf, dynConf, in, agnt, telemetryCollector, statsd, timing) | |
agnt.OTLPReceiver = api.NewOTLPReceiver(in, conf, statsd, timing) | |
agnt.RemoteConfigHandler = remoteconfighandler.New(conf, agnt.PrioritySampler, agnt.RareSampler, agnt.ErrorsSampler) | |
agnt.TraceWriter = writer.NewTraceWriter(conf, agnt.PrioritySampler, agnt.ErrorsSampler, agnt.RareSampler, telemetryCollector, statsd, timing) | |
return agnt | |
} |
We should instead either modify the statsd.Component
to support creating your custom statsd or create a new implementation of statsd.Component
that you can use to provide your client to the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you're right that we shouldn't set the global statsd client here, but rather the one used in OTLP receiver. Fixed that.
For statsd.Component
implementation - we cannot do it in this way, because the stats client that we want to supply is created after collector is started. The start sequence is fx -> collector -> collector internal telemetry -> statsd client for OTel. In fx we do not have the reference to the statsd client for OTel, thus it has to be set dynamically (like we discussed in the previous meeting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, it turns out we need to set the stats client used in the trace agent everywhere.
I'll set up some time to discuss. Doesn't seem straightforward to update the statsd reference everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I'm discussing this within the team too.
This comment has been minimized.
This comment has been minimized.
comp/otelcol/otlp/components/exporter/datadogexporter/factory.go
Outdated
Show resolved
Hide resolved
comp/trace/agent/def/component.go
Outdated
type Component interface{} | ||
type Component interface { | ||
// SetStatsdClient sets the stats client of the underlying trace agent. | ||
SetStatsdClient(mclient statsd.ClientInterface) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I'm discussing this within the team too.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me.
The SetOTelAttributeTranslator
method feels off for the same reason as the statsd one, but it's only affecting the otlp receiver, so blast radius for it is minimal. Maybe this can be improved later.
// SetOTelAttributeTranslator sets the OTel attributes translator of the underlying trace agent. | ||
SetOTelAttributeTranslator(attrstrans *attributes.Translator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels similarly hacky to the statsd setter. Ideally this could be handled without mutating the agent like we do with statsd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dineshg13 and I talked a bit about the attributes translator before. This exists for a legacy reason and is not well structured right now, e.g. the attributes translator is in traceagent.config
struct while it is not really a config. We have a backlog ticket to refactor & improve the OTLP receiver in trace agent in general, and this will be part of that effort.
/merge |
🚂 MergeQueue: pull request added to the queue The median merge time in Use |
❌ MergeQueue: The build pipeline contains failing jobs for this merge request Build pipeline has failing jobs for 755ac2f Since those jobs are not marked as being allowed to fail, the pipeline will most likely fail. You should have a look at the pipeline, wait for the build to finish and investigate the failures.
|
This comment has been minimized.
This comment has been minimized.
/merge |
🚂 MergeQueue: waiting for PR to be ready This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals. Use |
🚂 MergeQueue: pull request added to the queue The median merge time in Use |
What does this PR do?
Add trace exporter implementation to Datadog exporter in OTel agent.
Trace agent internal metrics are missing for now, will be added in a follow-up PR.
Motivation
Support trace ingestion in OTel Agent
Describe how to test/QA your changes
Build OTel agent and start the agent with Datadog exporter in the trace pipeline. Then verify traces are sent to DD.